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Abstract 

We propose the convex factorization machine (CFM), which is a convex variant of the widely used Fac¬ 
torization Machines (FMs). Specifically, we employ a linear-l-quadratic model and regularize the linear term 
with the ^ 2 -regularizer and the quadratic term with the trace norm regularizer. Then, we formulate the CFM 
optimization as a semidefinite programming problem and propose an efficient optimization procedure with 
Kazan’s algorithm. A key advantage of CFM over existing FMs is that it can find a globally optimal solution, 
while FMs may get a poor locally optimal solution since the objective function of FMs is non-convex. In 
addition, the proposed algorithm is simple yet effective and can be implemented easily. Finally, CFM is a 
general factorization method and can also be used for other factorization problems including multi-view ma¬ 
trix factorization and tensor completion problems. Through synthetic and movielens datasets, we first show 
that the proposed CFM achieves results competitive to FMs. Furthermore, in a toxicogenomics prediction 
task, we show that CFM outperforms a state-of-the-art tensor factorization method. 


1 Introduction 

In recommendation task including movie recommendation and news article recommendation, the data are repre¬ 
sented in a matrix form, A € RYIxKI^ where A is extremely sparse. Matrix factorization (MF), which imputes 
missing entries of a matrix with the low-rank constraint, is widely used in recommendation systems for news rec¬ 
ommendation, protein-protein interaction prediction, transfer learning, social media user modeling, multi-view 
learning, and modeling text document collections, among others [miiiiiaisjiTiiHiiniiin]. 

Recently, a general framework of MF called the factorization machines (FMs) has been proposed [TTl [T^ IT^ . 
FMs are applied to many regression and classification problems, including the display advertising challengqj, 
and they show state-of-the-art performance. The key contribution of the FMs is that they reformulate recom¬ 
mendation problems as regression problems, where the input a; is a feature vector that indicates the k-th user 
and the fc'-th item, and output y is the rating of the user-item pair: 

\u\ \i\ 

= fo • • • 0 ^ 0 •••0 0 •••0 0 • • • oF e 

k-th user fc'-th item 

yi = [A]k,k’- 

Here, d=\U\-\- 1/| is the dimensionality of a:, [A\k,k' is the score of the fc-th user and fc'-th item, and |A| = n is 
the number of non-zero elements. The goal of the FMs is to find a model that predicts y given an input x. 

^ https://www.haggle.com/c/criteo-display-ad-challenge 
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For FMs, the following linear + feature interaction model is employed: 

d d 

f{x;w,G)=wo + WQX + '^ ^ gjgi'xtxt, 

^=l i’=t+i 

where wq GR, Wq G R‘^, and G = [gi,... ,gm\ G are model parameters (m <C d). Since only the fc-th user 

and fc'-th item element of the input vector x is non-zero, the model can also be written as 

Ak,k' = Wo + [tUo]fe + [i^oJic/i+fe' + gkffjUI+kG 

which is equivalent to the matrix factorization model with global, user, and item biases. Moreover, since FMs 
solve the matrix completion problem through regression, it is easy to utilize side information such as about user’s 
and article’s meta information by simply concatenating the meta-information to x. 

For regression problems, the model parameters are estimated by solving the following optimization problem: 

n 

min y^(gi-f(x,;wo,wo,G)f 

Wo,'W,Cr ^ 

+ Ai|ko ||2 + A 2 ||tr;o|l 2 + A3||G||^, 

where the Ai, A 2 , and A 3 are regularization parameters, and IIGHif is the Frobenius norm. In m, stochastic 
gradient descent (SGD), alternating least squares (ALS), and Markov Chain Monte Carlo (MCMC) based ap¬ 
proaches were proposed. These optimization approaches work well in practice if regularization parameters and 
the initial solution of parameters are set appropriately. However, since the loss function is non-convex with 
respect to G, it can converge to a poor local optimum (mode). The MCMC-based approach tends to obtain a 
better solution than ALS and SGD. However, it requires running the sampler long enough to explore different 
local modes. 

In this paper, we propose the convex factorization machine (CFM). We employ the linear-l-quadratic model, 
Eq. dl]) and estimate w and W such that the squared loss between the output y and the model prediction is 
minimized. More specifically, we regularize the linear parameter w with the £ 2 -regularizer and the quadratic 
parameter W with the trace norm regularizer. Then, we formulate the CFM optimization problem as a semidef- 
inite programming problem and solve it with Hazan’s algorithm |I4) , which is a Frank-Wolfe algorithm |151116] . 
A key advantage of the proposed method over existing FMs is that CFM can find a globally optimal solution, 
while FM can get poor locally optimal solutions. Moreover, our proposed CFM framework is a general variant 
of convex matrix factorization with nuclear norm regularization, and the CFM algorithm is simple and can be 
implemented easily. Finally, since CFM is a general factorization framework, it can be easily applied to any 
factorization problems including multi-view factorization problems m- We demostrate the effectiveness of the 
proposed method first through synthetic and real-world datasets. Then, we show that the proposed method 
outperforms a state-of-the-art multi-view factorization method on toxicogenomics data. 

Contribution: The contributions of this paper are summarized below: 

• We formulate the FM problem as a semidefinite programming problem, which is a convex formulation. 

• We show that the proposed CFM framework includes the matrix factorization with nuclear norm regular¬ 
ization [TB] as a special case. 

• We formulate a Tucker based tensor completion problem mBHilEI] as a CFM problem. Thanks to the 
formulation, we can naturally handle large-scale sparse tensor completion problems. To our knowledge, 
this is the first work. 

• We propose a simple yet efficient optimization procedure for the semidefinite programming problem using 
Hazan’s algorithm El- 

• We applied the proposed CFM for a toxicogenomics prediction task; it outperformed a state-of-the-art 
method. 
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2 Proposed Method 

In this section, we propose the convex factorization machine (CFM) for regression problems. 

2.1 Problem Formulation 

We suppose that we are given n independent and identically distributed (i.i.d.) paired samples {(xi^yi) \ Xi S 
‘ 1 ’, Vi ^ y, i = ^, ■ ■ ■ ,n} drawn from a joint distribution with density p{x, y). We denote X = [xi, ..., a;„] £ 
l^dxn input data and y = [yi ,..., j/n]^ G R" as the output real-valued vector. 

The goal of this paper is to find a model that predicts y given an input x. 

2.2 Model 

We employ the following model: 


f{x;w,W)=wo + wJx + '^ ^ 

£=1 £'=i+l 

= z + ^ir[W{xx^ — (iia,g{x o x))), 


( 1 ) 


where z = [I w = [wq W £ is a positive semi-definite matrix, tr(X) is the 

trace operator, o is the elementwise product, and diag(a;) £ is the diagonal matrix. The difference between 
the FMs model and Eq. o is that gjgk' is parametrized as Wk,k'- 
The model can equivalently be written as 


f{x\w,W) = [m''' vec(FF)^] 


z 

ivec(a;a;^ —diag(a; o a;)) ’ 


where vec(FF) £ R"^ is the vectorization operator. Since the model is a linear model, the optimization problem 
is jointly convex with respect to both w and W if we employ a loss function such as squared loss and logistic 
loss. 


2.3 Optimization problem 

We formulate the optimization problem of CFM as a semidefinite programming problem: 

min \\y-f{X-w,W)\\l + Xi\\w\\l 

IP, Vv 

s.t. W^Oand ||tF||tr = ?7, (2) 

where 

f{X- w, W) = [f{x,;w, W),..., f{x„; w, W)]^ £ R”, 
and Ai > 0 and rj > 0 are regularizaiton parameters. ||W||tr is the trace norm defined as 

d 

||W||tr = tr(VTFTFF) 

where ai is the f-th singular value of W. The trace norm is also referred to as the nuclear norm [22) . Since the 
singular values are non-negative, the trace norm can be regarded as the £i norm on singular values. Thus, by 
imposing the trace norm, we can make W to be low-rank. 
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To derive a simple yet effective optimization algorithm, we first eliminate w from the optimization problem 
Eq.Q and convert the problem to a convex optimization problem with respect to W. Specifically, we take the 
derivative of the objective function with respect to w and obtain an analytical solution for w: 

w* = {ZZ^ + XiIa+i)-^Z{y - /q(X; W)), 

where 

/q(X; W) = [/q(®i; PE),G R", 

/q(®; W) = itr(PE {xx^ - diag(a; ox))), 

is the model corresponding to the quadratic term of f{x-,w,W) such that f{x-,w,W) = z + fq^x^W), 
Z = [ 21 ,..., z„] G Id+i G R<^+ixd+i is the identity matrix. Note that, w* depends on the unknown 

parameter PE. 

Plugging w* back into the objective function of Eq.Q, we can rewrite the objective function as 

min J{W) s.t. PE ^ 0 and ||EE||tr = ?7, (3) 

w 

where 

J{W) = {y- fqiX; W)yC{y - /q(X; TE)), 

C = RJR + R=In- Z^{ZZ^ + Aild+ij-^Z, and H = {ZZ^ + XiId+i)-^Z. 

Once PE is obtained by solving Eq. we can get the estimated linear parameter w as 

w = {ZZ^ + XiId+i)-^Z{y - fqiX; TE)). 


Relation to Matrix Factorization with Nuclear Norm Regularization: The constraint on PE can be 

written as 


PE = 


UU^ M 

VV^ 


h 0, tr(1717^) + tr(EE^) = y, 


where U G V G and M = UV^ G Eurthermore, for the GEM setting, the fc-th user 

and fc'-th item rating is modeled as 

= wq + [n!o]fc + [u^o]|c/|+fc' + [M]k,k'- 
Lemma 1 jil8f For any non-zero matrix M G R'^^" and y: 




iff 3 symmetric matrices G G and H G R"^*^ 


PE = 


G M 
H 


F 0, tr(G) + tr(Ff) = y. 


Based on Lemma 1, the optimization problem Eq. @ is equivalent to 


where 


min J{w,M) + Ai||u ;||2 s.t. ||M||tr < -, 
w.M 2 


J{w,M) := ^ {[A]k,k' - Wo - [nJo]fe - ['i«o]|c/|+fc' - [M]k,k'f, 

{k,k')^Q, 


( 4 ) 
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Algorithm 1 CFM with Kazan’s Algorithm 

Rescale loss function Jrj{W) = \\R{y — /q(X;? 7 W)|||. 

Initialize 

for all t = 0,1..., T = do 

Compute = approxEV( — VJn{W^*'>), 

at := (or at = argmin^, + a{p^*'>— W^*^))) 

^(*+1) = W(*) + St(p(*)pW^ - w(*)) 

end for 
return 


and n is the set of observed values in A. If we set lu = 0, the optimization problem is equivalent to matrix 
factorization with nuclear norm regularization [18) : CFM includes convex matrix factorization as a special case. 
Since we would like to have a low-rank matrix M of the user-item matrix A for recommendation, Eq. m is a 
natural formulation for convex EMs. Note that, even though CEM resembles the matrix factorization [TB]. the 
ME method cannot incorporate side information, while CFM can deal with side-information by concatenating it 
to vector x. That is, intrinsically, the MF method [18] and CFM are different. 

2.4 Kazan’s Algorithm 

For optimizing W, we adopt Kazan’s algorithm M- It only needs to compute a leading eigenvector of a sparse 
d X d matrix in each iteration, and thus it scales well to large problems. Moreover, the proposed CFM update 
formula is extremely simple, and hence useful for practitioners. The Kazan’s algorithm for CFM is summarized 
in Algorithm [TJ 

Derivative computation: The objective function J(W) can be equivalently written as 

n n 

- fQ{xf,W)){y, - fQ{x,-W)). 

i=l 3 = 1 


Then, is given as 

VJ(W(‘)) = 

n n 

= -diag(^ Ci 3 {y 3 - fQ{xp,W )),..., ^ - fgix^; FF)), 

1=1 1=1 

where we use ^ = xx^. Since the derivative is written as X.D'd), the eigenvalue decomposition can 

be obtained without storing in memory. Moreover, since the matrix A is a sparse matrix, we can 

efficiently obtain the leading eigenvector by the Lanczos method. We can use a standard eigenvalue decomposition 
package to compute the approximate eigenvector by the ’’approxEV” function. For example in Matlab, we can 
obtain the approximate eigenvector p by the function [p, Z] = eigs(—V 1,' LA', Options.tol= ), 

where I is the corresponding eigenvalue. 

The proposed CFM optimization requires a matrix inversion (i.e., 0{n^)) for computing C in and 

it is not feasible if the dimensionality d is large. For example in user-item recommendation task, the total 
dimensionality of the input can be the number of users + the number of items. In such cases, the dimensionality 
can be 10® or more. Kowever, fortunately, the input matrix X is extremely sparse, and we can efficiently compute 
JDW by using a conjugate gradient method whose time complexity is 0{n). 
can be written as 


= diag(C'y(‘^) = dmg{{R^ R+ XiH^ 
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where = y — Since the number of samples n tends to be larger than the dimensionality 

d in factorization machine settings, ZZ^ becomes full rank. Namely, we can safely make the regularization 
parameter Ai = 0. In such case, is given as 

=diag(y(‘)-Z^ 

where we use C = RJR = I — Z^{ZZ^)~^Z. The = [ZZ^)~^Zy^*‘^ is obtained by solving 

(5) 

where can be efficiently obtained by a conjugate gradient method with time complexity 0(n). Thus, we can 
compute without computing the matrix inverse To further speed up conjugate gradient method, 

we use a preconditioner and the previous solution as the initial solution. 

Finally, we compute as 

= diag(y - /(X; 


The diagonal elements of are the differences between the observed outputs and the model predictions at 
the t-th iteration. Note that, in our CFM optimization, we eliminate w and only optimize for W; however, the 
w is implicitly estimated in Hazan’s algorithm. 

Complexity: Iteration t in Algorithm [1] includes computing an approximate leading eigenvector of a sparse 
matrix with n non-zero elements and an estimation of w, which require 0{n) computation using Lanczos al¬ 
gorithm and 0{n) computaiton using conjugate gradient descent, respectively. Thus, the entire computational 
complexity of the proposed method is 0{Tn), where T is the total number of iterations in Hazan’s algorithm. 


Optimal step size estimation: Hazan’s algorithm assures W converges to a global optimum with using the 
step size at = 2/(2 + t),t = 0,1,... ,T [18]. However, this is in practice slow to converge. Instead, we choose 
the ak that maximally decreases the objective function J{W). The optimal ak can be obtained by solving the 
following equation: 


at=argmin 


= argmin 


R{y-fQ{X-,{l-a)W^*^ 


2 

2 


Taking the derivative with respect to a and solving the problem for a, we have 

^ (y - /q(X; 

\\R{fQiX-,pWpW^ -w(n\i 

The computation of at involves the matrix inversion of R. However, by using the same technique as in the 
derivative computation, we can efficiently compute at. 

Update W: When the input dimension d is large, storing the feature-feature interaction matrix W is not 
possible. To avoid the memory problem, we update as 

W(‘+i) = p(*+i)diag(A(‘+i))p(*+i)^, 

p{t+l) _ jp(t) p(i)j g ]^dx(t+l)^ 

x(‘+i) _ / (l-St)A(‘^ (fc < t) 

\ Si (k = t + l) ’ 

where Thus, we only need to store pS+i) g ^^><(*+1) and A^+i) at the (t + l)-th iteration. In 

practice, Hazan’s algorithm converges with t = 100 (see experiment section), so the required memory for Hazan’s 
algorithm is reasonable. 
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Prediction: Let us define U = Pdiag(A)^/^ G such that W = UU^. Then, we can efficiently compute 

the output as 


f{x; w, W) = z + i (||i7^a:||2 — {xo xY {U o Lf)l) . 
The time complexities of the terms are 0(d), 0{d{t + 1)), and 0{d), respectively. 


2.5 Tensor completion with CFM 

In this section, we formulate a Tucker based tensor completion problem [20l [21] as a CFM problem. 

Let us denote the input 3-way tensor as 3^ G ]^nixn 2 xri 3 ^ where ni, n 2 , and na are the number of samples in 
each mode. In this paper, we consider the following regularization based learning model: 

3 \ ^ 3 

m—1 / m—1 


mm 



where G jg pjjg^g tensor, G jg the m-th mode tensor, A > 0 is the 

regularization parameter, is the unfolded matrix with respect to the m-th mode, G ]^"ixn 2 n 3 ^ 

-3^(3) € R"3xnira2^ ^The final goal of this paper is to learn M from y by minimizing 
J{A4). Now, we reformulate Eq. (jT]) by CFM. 

Let us define the pooled matrix: 


Gi 


W = 


G2 Mg] 


Ga 


M, 


( 3 ) 

( 3 ) 


M, 


( 3 ) 


( 3 ) 

Ha 


G 0, 


where d = X]m=i + Sm=i Sm'=m+i Note, the off-diagonal matrices are not important for deriving 

optimization algorithm, and thus, we omit them here. Moreover, since the matrix W is a positive semi-definite 
matrix, we can decompose it as 


W = 


Ua 

Ti 

Ua 

Va 


c/r 


u' V' 


Lemma 2 \2,‘^ For a 3-way tensor case, we have: 
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Then, we can rewrite X]m=i 


3 1 

= ^tr {W{x,j^kxJj^k - dmg{xij^k o a^ij.fc))) , 


X 


Sfe = [0 ••■0 0 •••0 0 •••0 


( 2 ) 




0 --- 0 ]^ e ]^ni+n2n3^ 


n 2 (fc-l)+j 

nina 


= fo •••0 1 0 •••0 0 •••0 


e ]^n2+mn3 


J 

ns 


713 ( 2 - 1 )+^ 

nin 2 


= fo • • • 0 1 0 •••0 0 •••0 1 o~~~o]^ e R”"+”i"2, 


r (1) ^ (2) 3^ (3) 3^iT ^ 

= [xllk ^ij,k XiJ,k \ e 


niU-l)+i 

hd 


For the bias tensor Ai , we parametrize it as 


Xi j]^ — 


Wi 

0n2n3 

W2 

0nin3 

Ws 

^n\n2 




( 8 ) 


where w € M“, Wi € W 2 € and W 3 € R"^. Note, we use this parameterization, since the number of 
dimension d can be much bigger than the number of non-zero elements n and it is hard to solve Eq. 

Lemma 3 For the matrices G ^ G R"3^"i"2 




(m) 


Itr < 2 




w = 


Gi M, 


( 1 ) 


M, 


( 1 ) 

( 1 ) 


( 1 ) 

Hi 


r(2) 

'( 2 ) 


M, 


( 2 ) 

( 2 ) 


H. 


G 3 


M. 


( 3 ) 

( 3 ) 


H, 


hO, 


Lm=i tr(Gm) -f tr{Hm) = V- 

Proof: This is a variation of the Lemmal of m- From the characterization: 


Eii“, 


("*)|l _ 

(m) lit'' “ 


m—1 




y] (iiJ7„ii^ + iiK„ii^) 


m—1 













we have that 3 Um, Vm, UmVm = = li 2, 3 s.t. 


2 E ll^wlltr = E W^rnfp + W^^rnfp = E C/^) + tr ) < rj. 


m—1 


That is, we have 


W = 


m—1 


C/iiJT Mg 
^T 


m—1 


M, 


( 1 ) 

( 1 ) 


ViVi 


T 


U2UJ M, 


( 2 ) 

( 2 ) 


M, 


( 2 ) 

( 2 ) 


V 2 V 2 ' 


U.UJ M, 




(3) 

(3) 


^ 3 ^ 3 ^ J 


where tr(VF) < 77 and W >zO. If s = tr(M^) < 77, we can add {t — s)eiej to UiUi , and we have tr(VF) = 77. 
If the matrix is symmetric and positive semi-definite, we can decompose W as 


W = 


Ui 

Vi 

c/3 

V 3 


Ul vr 




such that UmV^ =M{g, 77 i= 1,2,3 arid 77 = MC/^C/^) + tr(y„K[) = EEi l|C/rn||?. + ||Kn|l|. □ 

Based on the Lemma O we can rewrite the optimization problem as 


min E] {iy]i,j,k - - tr - dia,g{xij^k o a?i,j,fe))))" 

{i,j,k)£Q 

s.t ^ 0 , tr(W) = 77 . 


Since this is a CFM problem, we can efficiently solve it with Kazan’s algorithm. 


3 Related Work 

First of all, the same problem setting as in our work has been addressed quite recently [24], being independent 
of our work. The key difference between the proposed method and |24| is that our approach is based on a 
single convex optimization problem for the interaction term W. The approach |24] uses a block-coordinate 
descent (BCD) algorithm for optimization, optimizing the linear and quadratic terms alternatively. That is, 
they alternatingly solve the following two update equations until convergence: 

= argmin \\y - f{X;w,W^*'>)\\l -b Xi\\w\\l, 

W 

= argmin \\y - f{X;w(*+^\W)\\l + X 2 \\W\\tr, 

w 

while our proposed approach is simply given as 

W = argmin \\R{y - fQ{X-,W))\\ls.t. ||W||tr = 77 . 
wyo 
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Hence, the BCD algorithm needs to iterate the sub-problem for W until convergence for obtaining the globally 
optimal solution. 

Let us employ an 0(n) algorithm for the trace norm minimization in BCD; then the entire complexity is 
0{T'Tn) where T' is the BCD iteration and T is the iteration of the sub-problem. On the other hand, our 
algorithm’s complexity is 0{Tn). Another difference is that our optimization approach includes the matrix 
factorization with nuclear norm regularization as a special case, while it is unclear whether the same holds for 
the formulation [24]. Finally, our CFM approach is very easy to implement; the core part of the proposed 
algorithm can be written within 20 lines in Matlab. Note, the BCD based approach is more general than our 
CFM framework; it can be used for other loss functions such as logistic loss and it does not require the positive 
definiteness condition for W. 

The convex variant of matrix factorization has been widely studied in machine learning community [2511261 
nzmi m [2011301131]. The key idea of the convex approach is to use the trace norm regularizer, and the 
optimization problem is given as 

M = argmin \\Vn{A) - Vn{M)\\% + X\\M\Ur, (9) 

M 

where D is the set of observed value in A, [PQ,{A)\ij = [A\i^j if i,j G ft and 0 otherwise, and || ■ ||f is the 
Frobenius norm. Since Eq.([9l) and Eq.(|4]) are equivalent when w = 0, the convex matrix factorization can be 
regarded as a special case of CFM. 

To optimize Eq. the singular value thresholding (SVT) method has been proposed [311|33], where SVT 
converges faster in 0(-^) (e is an approximate error). However, the SVT approach requires to solve the full 
eigenvalue decomposition, which is computationally expensive for large datasets. To deal with large data, Frank- 
Wolfe based approaches have been proposed including Hazan’s algorithm [18] , corrective refitting [^ , and active 
subspace selection [35]. However, these approaches cannot incorporate user and item bias. Furthermore, it is not 
straightforward to incorporate side information to deal with cold start problems (i.e., recommending an item to 
a user who has no click information). 

To handle cold start problems, collective matrix factorization (collective MF) has been proposed [35]. The 
key idea of collective MF is to incorporate side information into matrix factorization. More specifically, we 
prepare a user x user meta matrix (e.g., gender, age, etc.) and an item x item meta matrix (item category, item 
title, etc) in addition to a user-item matrix. Then, we factorize all the matrices together. A convex variant of 
CMF called convex collective matrix factorization (CCMF) has been proposed [37]. CCMF employs the convex 
collective norm, which is a generalization of the trace norm to several matrices. Recently, Hazan’s algorithm 
was introduced to CCMF [9]. More importantly, it has been theoretically justihed that CCMF can give better 
performance in cold start settings. Since FMs can incorporate side information, FMs and CCMF are closely 
related. Actually, CFM can utilize side information and can learn the user and item bias term together; it can 
be regarded as a generalized variant of CCMF. 

4 Experiments 

We evaluate the proposed method on one synthetic dataset, Movielens data (single matrix), and toxicogenomics 
data (two-view tensor). 

In this paper, we compare CFM with ridge regression, FM (SGD), FM (MCMC) and FM (ALS), where FM 
(MCMC) is a state-of-the-art FM optimization method. The ridge regression corresponds to the factorization 
machine with only the linear term f{x) = wq + x, which is also a strong baseline. To estimate FM models, 
we use the publicly available libFM packagcH. For all experiments, the number of latent dimensions of FMs is set 
to 20, which performs well in practice. For FM (ALS), we experimentally set the regularization parameters as 
Ai = 0 and A 2 = 0.01. The initial matrices W (for CFM) and G (for FMs) are randomly set (this is the default 
setting of the libFM package). For CFM, we implemented the algorithm with Matlab. We experimentally set 

^ http;//www.libfm.org 
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Figure 1: Convergence of the methods: test RMSE of the synthetic experiment. CFM is the proposed convex 
factorization machine, FM (ALS) is the factorization machine with ALS optimization, and FM (MCMC) is the 
factorization machine with MCMC optimization. 


C/ = 1, and it works for our experiments. For all experiments, we use a server with 16 core 1.6GHz CPU and 
24G memory. 

When evaluating the performance of CFM and FMs, we use the root mean squared error (RMSE): 

^test 

where y* and y are the true and estimated target values, respectively. 

4.1 Synthetic Experiments 

First, we illustrate how the proposed CFM behaves using a synthetic dataset. 

In this experiment, we randomly generate input vectors x G as a; ^ A/'(0, J), and output values as 

d d 

y = wo + w^x + '^ ^ 

where 

Wo ^ A/'(0,1), w ~ Af{0, /), Wg^i! ^ Uniform([0 1]). 

We use 900 samples for training and 100 samples for testing. We run the experiments 5 times with randomly 
selecting training and test samples and report the average RMSE scores. Figure [1] shows the test RMSE for CFM 
and FMs. As can be seen, the proposed CFM gets the lowest RMSE values with a small number of iterations, 
while FMs needs many iterations to obtain reasonable performance. 

4.2 Recommendation Experiments 

Next, we evaluate our proposed method on the Movielens lOOK, IM, lOM, and 20M datasets [3S] (Tabled] for 
dataset details). In these experiments, we randomly split the observations into 75% for training and 25% for 


\ 


“^test 
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Figure 2: RMSE over iteration (t). at = j train and at = j test are training and test RMSE with using 
stepsize. Optimal at train and at test are training and test RMSE with using the optimal stepsize Eq. dsn¬ 


testing. We rnn the recommendation experiments on three random splits, which is the same experimental setting 
as in [24], and report the average RMSE score. 

Eor GEM, the regularization parameter ry is experimentally set to 2000 (for lOOK), 4000 (for IM), 20000 (for 
lOM), and 40000 (for 20M), respectively. Eor EMs, the rank is set to 20, which gives overall good performance. To 
investigate the effect of the initialization parameter, we initialize EM (MCMC) with two parameters stdev = 0.05 
and stdev = 0.1, which are the standard deviation of the random variable for initializing G. We also report the 
RMSE of the GEM method of [ 21 ] for reference. 

Figure 12] shows the training and test RMSE with the GEM (optimal step size) and the GFM {at = 5 ^) for 
the Movielens datasets. Eor both methods, the RMSE of training and test is converging with a small number 
of iterations. Overall, the optimal step size based approach converges faster than the one based on at = 2 ^- 
Figure m shows the RMSE over computational time (seconds). Eor large datasets, the GFM achieves reasonable 
performance in less than an hour. In Table |21 we show the RMSE comparison of the proposed GFM with FMs. 
As we expected, GFM compares favorably with FM (SGD) and FM (ALS), since FM (SGD) and FM (ALS) can 
be easily trapped at poor locally optimal solutions. Moreover, our GFM method compares favorably with also 
the GFM (BGD) [ 21 ]. On the other hand, FM (MGMG) can obtain better performance than GFMs (both our 
formulation and [ 21 ]) for these datasets if we set an appropriate initialization parameter. This is because MGMG 
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(a) Movielens lOOK data. 


(b) Movielens IM data. 
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(d) Movielens 20M data. 


Figure 3: RMSE over time (seconds), at = j test is the test RMSE with using stepsize. Optimal at test is 
the test RMSE with using optimal stepsize. 


tends to avoid poor locally optimal solution if we run the sampler long enough. That is, since the objective 
function of FMs is non-convex and it has more flexibility than the convex formulation, it can converge to a 
better solution than CFM if we initialize FMs well. 

4.3 Prediction in Toxicogenomics 

Next, we evaluated our proposed method on a toxicogenomics dataset m- The dataset contains three sets 
of matrices representing gene expression and toxicity responses of a set of drugs. The first set called Gene 
Expression, represents the differential expression of 1106 genes in three different cancer types, to a collection 
of 78 drugs (i.e., G = 1,2,3). The second set. Toxicity, contains three dose-dependent toxicity 

profiles of the corresponding 78 drugs over the three cancers (i.e., G / = 1, 2, 3). The gene expression 

data of the three cancers (Blood, Breast and Prostate) comes from the Connectivity Map [39] and were processed 
to obtain differential expression of treatment vs control. As a result, the expression scores represent positive or 
negative regulation with respect to the untreated level. The toxicity screening data, from the NCT60 database 
[40] . summarizes the toxicity of drug treatments in three variables G150, LC50, and TGI, representing the 50% 
growth inhibition, 50% lethal concentration, and total growth inhibition levels. The data were conformed to 
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Table 1: Movielens datasets. 


Dataset 

\u\ 

|/| 

d 

|A| =n 

Movielens lOOK 

943 

1,682 

2,625 

100,000 

Movielens IM 

6,040 

3,900 

9,940 

1,000,209 

Movielens lOM 

82,248 

10,681 

92,929 

10,000,054 

Movielens 20M 

138,493 

27,278 

165,771 

20,000,263 


Table 2: Test RMSE of CFM, CFM (BCD), FMs, and ridge regression for Movielens data sets. 


Dataset 

GFM 

GFM (BCD) 

FMsgd 

FMals 



Ridge 

lOOK 

0.915 

0.93 

1.078 

1.242 

0.905 

0.901 

0.936 

IM 

0.866 

0.85 

0.943 

0.981 

0.877 

0.846 

0.899 

lOM 

0.810 

0.82 

0.827 

0.873 

0.831 

0.778 

0.855 

20M 

0.802 

n/a 

0.821 

0.852 

0.803 

0.768 

0.850 


represent dose-dependent toxicity profiles for the doses used in the corresponding gene expression dataset. 

Predicting both gene and toxicity matrices: We compared our proposed method with existing state-of-the- 
art methods. In this experiment, we randomly split the observations into 50% for training (129,466 elements) 
and 50% for testing (129,465 elements), which is the exactly same datasets used in [T7]. We run the prediction 
experiments on 100 random splits HZ], and report the average relative MSE score, which is defined as 


where y*’" is the target score vector, ^ is the estimated score vector, and y*’" is the mean of elements in y*’”, 
V is the number of views. In this experiment, the number of views is F = 2. Since the number of elements in 
view 1 and view 2 are different, the relative MSF score is more suitable than the root MSF score. We compare 
our proposed method with ARDCP [41], CP [42], Group Factor Analysis (GFA) [43], and Bayesian Multi-view 
Tensor Factorization (BMTF) [TT]. BMTF is a state-of-the-art multi-view factorization method. 

For GFM, we first concatenate all view matrices as 




to 

A(2) 

^3 

A(2) 

^3 




and use this matrix for learning. The regularization parameter y is experimentally set to 1000. To deal with 
multi-view data, we form the input and output of GFM as 


3327 


78 


X = (0 •••0 ^ 0---0 0 •••0 ^ 0---0 e 

z-th gene i-th drug 1st view 

y = 

Table 13] shows the average relative MSF of the methods. As can be seen, the proposed method outperforms 
the state-of-the-art methods. 


Predicting toxicity matrices using Gene expression data: We further evaluated the proposed GFM on 
the toxicity prediction task. For this experiment, we randomly split the observations of the toxicity matrices into 
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Table 3: Test relative MSE of both gene expression and toxicogenomics data. 



CFM 

Mult 

BMTF 

i-view 

GFA 

ARDCP 

CP 

Single-view 
ARDCP 

CP 

Mean 

StdError 

0.4037 

0.0163 

0.4811 

0.0061 

0.5223 

0.0041 

0.8919 

0.0027 

5.3713 

0.0310 

0.6438 

0.0047 

5.0699 

0.0282 


Table 4: Test relative MSE on toxicogenomics data. 



CFM 

CFM {+ 
TO = 5 

mean/std 
TO = 10 

features) 
TO = 15 

CFM 

TO = 5 

(-l-mean fi 
TO = 10 

sature) 

TO = 15 

Mean 

StdError 

0.5624 

0.0501 

0.5199 

0.0464 

0.5207 

0.0451 

0.5215 

0.0450 

0.5269 

0.0466 

0.5234 

0.0454 

0.5231 

0.0450 


50% for training (341 elements) and 50% for testing (341 elements). Then, we used the gene expression matrices 
Ai, A 2 , A 3 as side information for predicting the toxicity matrices. More specifically, we designed two types of 
features from the gene expression data: 

• Mean of m- nearest neighbor similarities: (xmean) We first find the m-nearest neighbors of the i-th 
drug target, where the Gaussian kernel is used for similarity computation. Then, we average the similarity 
of 1 ,..., m-th nearest neighbors. 

• Standard deviation of to- nearest neighbor similarities: (igtd) Similarly to the mean feature, we first 
found TO-nearest neighbor similarities and then computed that’s standard deviation. 

Then, we used these features as 

78 3 3 2 

a; = f0 •••0 ^ 0 •••0 6 ^ 0 0 ^ 0 

2 -th drug k-th sensitivity Z-th cancer type 



We run the prediction experiments on 100 random splits, and report the average RMSE score (Table S]). 
‘CFM’ is ‘GEM without any additional features. It is clear that the performance of GEM improves by simply 
adding manually designed features. Thus, we can improve the prediction performance of CFM by designing new 
features, and it is useful for various prediction tasks in biology data. 


5 Conclusion 

We proposed the convex factorization machine (CFM), which is a convex variant of factorization machines 
(FMs). Specifically, we formulated the CFM optimization problem as a semidefinite program (SDP) and solved 
it with Kazan’s algorithm. A key advantage of the proposed method over FMs is that CFM can find a globally 
optimal solution, while FMs can get poor locally optimal solutions since they are non-convex approaches. The 
derived algorithm is simple and can be easily implemented. We also showed the connections between CFM 
and convex factorization methods and CFM and convex tensor completion methods. Through synthetic and 
real-world experiments, we showed that the proposed CFM achieves results competitive with state-of-the-art 
methods. Moreover, for a toxicogenomics prediction task, CFM outperformed a state-of-the-art multi-view 
tensor factorization method. 

In future work, we will extend the proposed method to distributed computation. Another important challenge 
is to improve the convergence properties of the proposed method. 
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